class: center, middle, inverse, title-slide .title[ # The Chi Square Test and Measures of Association ] .subtitle[ ## EDP 613 ] .author[ ### Week 12 ] --- <script src="https://ajax.googleapis.com/ajax/libs/jquery/3.6.0/jquery.min.js"></script> <script type="text/x-mathjax-config"> MathJax.Hub.Register.StartupHook("TeX Jax Ready",function () { MathJax.Hub.Insert(MathJax.InputJax.TeX.Definitions.macros,{ cancel: ["Extension","cancel"], bcancel: ["Extension","cancel"], xcancel: ["Extension","cancel"], cancelto: ["Extension","cancel"] }); }); </script>
# A Note About The Slides Currently the equations do not show up properly in Firefox. Other browsers such as Chrome and Safari do work. --- # Independence Two variables that have no association with each other are **statistically independent**. --- # Frequencies -- - **expected frequencies** > written `\(f_e\)` -- > what you would *expect* in a bivariate table if two variables were statistically independent -- > only assumption: the null hypothesis is true -- > calculated by `$$f_e = \dfrac{\text{column marginal}\cdot\text{row marginal}}{\text{total sample size}}$$` -- - **observed frequencies** -- > written `\(f_o\)` -- > what you would *observe* in a bivariate table given what you have -- > calculated by you or given --- # Chi-Square Test -- > written `\(\chi^2\)`. -- > assumes *random sampling* -- > Is an inferential test to find significant relationships between two variables. -- > Calculated by `$$\chi^2 =\sum\dfrac{(f_o-f_e)^2}{f_e}$$` -- > with `$$df = (r-1)(c-1)$$` --- ### Example: Social Media The percent of people using at least one social media outlet is given below by age groups .footnote[Source: [*Pew Research Center: Social Media Fact Sheet*](https://www.pewresearch.org/internet/fact-sheet/social-media/)] .pull-left[ <center>In 2011:</center> <br> <table class="table" style="width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Age </th> <th style="text-align:center;"> Portion </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 18 - 29 </td> <td style="text-align:center;"> 820 </td> </tr> <tr> <td style="text-align:left;"> 30 - 49 </td> <td style="text-align:center;"> 590 </td> </tr> <tr> <td style="text-align:left;"> 50 - 64 </td> <td style="text-align:center;"> 360 </td> </tr> <tr> <td style="text-align:left;"> 65+ </td> <td style="text-align:center;"> 120 </td> </tr> </tbody> </table> ] -- .pull-right[ <center>In 2021:</center> <br> <table class="table" style="width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Age </th> <th style="text-align:center;"> Responses </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 18 - 29 </td> <td style="text-align:center;"> `840` </td> </tr> <tr> <td style="text-align:left;"> 30 - 49 </td> <td style="text-align:center;"> `810` </td> </tr> <tr> <td style="text-align:left;"> 50 - 64 </td> <td style="text-align:center;"> `730` </td> </tr> <tr> <td style="text-align:left;"> 65+ </td> <td style="text-align:center;"> `450` </td> </tr> </tbody> </table> ] -- a. Test the assumption that *users are equally likely* to be in each of the four age groups listed. b. Which age group contributes the largest amount to the test statistic? --- ### Example: Solution for 2011 a. We have -- `$$H_0: \text{Users are equally likely to be in each of the four groups listed}$$` -- `$$H_1: \text{Users are NOT equally likely to be in each of the four groups listed}$$` -- > Step 1: Find `\(N\)` <br> <br> -- .pull-left[ We have `\(820+590+360+120=1890\)` total responses ] -- .pull-right[ If the distribution was uniform across all four categories, we would expect that each had `\(1890/4\approx472\)` respondents ] --- > Step 2: Caluclate the `\(\chi^2\)` statistic <table class="table" style="width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Age </th> <th style="text-align:center;"> Responses </th> <th style="text-align:center;"> `\chi^2` </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 18 - 29 </td> <td style="text-align:center;"> `820` </td> <td style="text-align:center;"> `\frac{(820-472)^2}{472} \approx 256.576` </td> </tr> <tr> <td style="text-align:left;"> 30 - 49 </td> <td style="text-align:center;"> `590` </td> <td style="text-align:center;"> `\frac{(590-472)^2}{472} \approx 29.500` </td> </tr> <tr> <td style="text-align:left;"> 50 - 64 </td> <td style="text-align:center;"> `360` </td> <td style="text-align:center;"> `\frac{(360-472)^2}{472} \approx 26.576` </td> </tr> <tr> <td style="text-align:left;"> 65+ </td> <td style="text-align:center;"> `120` </td> <td style="text-align:center;"> `\frac{(120-472)^2}{472} \approx 62.509` </td> </tr> </tbody> </table> -- with the total `$$256.576 + 29.500 + 26.576 + 62.509 = 375.161$$` -- and `$$df = 4-1 = 3$$` --- > Step 3: Make a Decision -- In Appendix D - Look at `\(df = 3\)` -- - `\(\chi^2=375.161\)` < the greatest `\(p\)`-value so `\(p<0.001\)` -- - We reject `\(H_0\)` implying that -- <center> <i> respondents are not equally likely to be in each of the four age ranges listed</i> </center> --- b . - 65+ contributes the greatest amount to the sum for the test statistic -- - The observed count is much smaller than expected --- ### Example: Solution for 2021 We have `$$H_0: \text{Users are equally likely to be in each of the four groups listed}$$` `$$H_1: \text{Users are NOT equally likely to be in each of the four groups listed}$$` --- > Step 1: Find `\(N\)` <br> <br> -- .pull-left[ We have `\(840+810+730+450=2830\)` total responses ] -- .pull-right[ If the distribution was uniform across all four categories, we would expect that each had `\(2830/4\approx707\)` respondents ] --- > Step 2: Caluclate the `\(\chi^2\)` statistic <table class="table" style="width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Age </th> <th style="text-align:center;"> Responses </th> <th style="text-align:center;"> `\chi^2` </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 18 - 29 </td> <td style="text-align:center;"> `840` </td> <td style="text-align:center;"> `\frac{(840-707)^2}{707} \approx 25.020` </td> </tr> <tr> <td style="text-align:left;"> 30 - 49 </td> <td style="text-align:center;"> `810` </td> <td style="text-align:center;"> `\frac{(810-707)^2}{707} \approx 15.006` </td> </tr> <tr> <td style="text-align:left;"> 50 - 64 </td> <td style="text-align:center;"> `730` </td> <td style="text-align:center;"> `\frac{(730-707)^2}{707} \approx 0.748` </td> </tr> <tr> <td style="text-align:left;"> 65+ </td> <td style="text-align:center;"> `450` </td> <td style="text-align:center;"> `\frac{(450-707)^2}{707} \approx 93.422` </td> </tr> </tbody> </table> with the total `$$5.020+15.006+0.748+93.422 = 134.196$$` and `$$df = 4-1 = 3$$` --- > Step 3: Make a Decision -- In Appendix D - Look at `\(df = 3\)` - `\(\chi^2=33.526\)` < the greatest `\(p\)`-value so `\(p<0.001\)` -- - We reject `\(H_0\)` implying that <center> <i> respondents are not equally likely to be in each of the four age ranges listed</i> </center> --- ## That's it. Take a break before our R session!